Simulation Study Setup

I simulate the data in modules (blocks). Within each module, I first generate a random standard normal vector (eigengene) and then calculate the expression levels of all other genes in a module as a function of that eigengene.

In this simulation setup I have simulated 6 modules with the following proportions and environment dependent correlations :

Blocks

  1. Turquoise (0.15). Correlated with Blue module regardless of environment
  2. Blue (0.15). Correlated with Turquoise module regardless of environment
  3. Red (0.15). Correlated with Green module AND within module, only when E=1
  4. Green (0.15). Correlated with Red module AND within module, only when E=1
  5. Yellow (0.15). Uncorrelated with any other module regardless of environment
  6. Grey (0.25). Uncorrelated within and between modules regardless of environment

Regression coefficients

I have kept the number of active variables fixed at 50. With the first 25 genes in the Green and Red modules corresponding to the active genes. All active genes have interaction terms as well. Therefore there the true model contains a 51 main effects (50 gene + 1 environment) and 50 interactions associated with the response. I generate the main effects from a \(Unif(3.9,4.1)\), the interaction effects from a \(Unif(1.9,2.1)\), and \(\beta_E = 5\).

Error Variance

Let \(Y^* = \beta_E E + \sum_j \beta_j X_j + \alpha_{jE} X_j E\). We generate a continuous response \(Y = Y^* + k \cdot \varepsilon\) where the error term \(\varepsilon\) is generated from a standard normal distribution, and \(k\) is chosen such that the signal-to-noise ratio \(\eta = \left(Var(Y^*)/Var(\varepsilon)\right)\) is 0.5 (e.g. the variance of the response variable \(Y\) due to \(\varepsilon\) is \(1/\eta\) of the variance of \(Y\) due to \(Y^*\))

Simulation parameters

I plan on varying the total number of genes (500, 1000, 3000), while keeping the proportions of each module fixed, and the number of active variables fixed. I also plan on varying the degree of correlation between the green and red modules when E=1 (0.1, 0.35, 0.75, 0.95).

Below I have plotted the heatmaps of several similarity matrices, and have labled the module that they truly belong to, and which of those genes are active. The figures below are based on

  • \(n0 = 100\) (number of subjects unexposed)
  • \(n1 = 100\) (number of subjects exposed)
  • \(p = 500\)
  • \(rho = 0.90\) (correlation between red and green modules when E=1)

Heatmaps of Correlations

\(\rho_{All}\)

\(\rho_{E=0}\)

\(\rho_{E=1}\)

\(S_{pearson} = |\rho_{E=1} - \rho_{E=0}|\)

\(S = |\rho_{E=0} + \rho_{E=1} - 2 \rho|\)

This is one of the Parmigianni 2005 scores, defined as the gap/substitution score. This score finds gene pairs that jointly discriminate the two environments. We choose \(\alpha = 2\) because then the score is proportional to \((\rho_1+\rho_2)/2 - \rho\), i.e., the difference between the average of the conditional correlations, and the combined correlations. (In a later paper, the authors suggested 2 instead of the 1.5). By using 2, we now see that the yellow, blue and turquoise model are not highlighted. This score is meant to capture situations such as the on shown in the Artificial Example 1 in the figure below. That is, two genes show a pronounced joint association on the phenotype: if the sum of their expression levels exceeds 3 units, only the blue-triangle phenotype is observed. Neither of the two genes shows a strong association with the phenotype in the univariate marginal distribution.

In their paper they also suggest the difference of class conditional spearman correlations: \(|\rho_{E=0} + \rho_{E=1}|\) which captures the Artificial Example 2. If both genes are off (expression values below 1.5 units), or both genes are on (expression value above 1.5 units), we observe the red-circle phenotype. In contrast, if only one of the genes is turned on, the blue-triangle phenotype is predominant.

\(S_{spearman} = |\rho_{E=1} - \rho_{E=0}|\)

\(\rho_{E=1}\) and \(\rho_{E=0}\) are the class conditional Spearman correlations. This score is also proposed in Parmigianni 2005. We see that this score better captures the genes of interest, with scores greater than 1.

Fisher’s Z Transformation

Let \(\rho_{ijk}\) be the correlation between genes \(i\) and \(j\) in class \(k\). Fisher’s Z transformation first involves transforming the correlations into z values: \(z_{ijk} = 0.5 log | (1 + \rho_{ijk})/(1- \rho_{ijk}) |\). The Z-test statistic is given by \(|z_{ij0}-z_{ij1}|/\sqrt{1/(n_0-3) + 1/(n1-3)} \sim \mathcal{N}(0,1)\). This assumes genes \(i\) and \(j\) follow a bivariate normal distribution.